473,461 Members | 1,852 Online
Bytes | Software Development & Data Engineering Community
Create Post

Home Posts Topics Members FAQ

Javascript Collection, Obfuscation, Crawling?

Hello all,
I am a visiting researcher at a laboratory this summer and my
current task is investigating javascript obfuscation techniques. I am
trying to get a relatively large sample of website containing
javascript code so I can analyze it and determine if it is:
1) obfuscated
2) malicious

I have a fairly decent inference what the result will be, but it would
be nice to have statistics on my side. Having said that, I believe it
will be necessary to have a very large sample size to perform my
analysis.

Now for my question, does anyone know if there are any ways to utilize
a web browser or other component to automatically find javascript
samples? Google has not yielded any results, and the code search
merely searches repositories; not exactly what I need.

Short of rolling my own crawler, can anyone offer any suggestions that
will aid me in my task?

Thanks!

Jul 24 '07 #1
8 2011
Steve H. wrote on 24 jul 2007 in comp.lang.javascript:
Hello all,
I am a visiting researcher at a laboratory this summer and my
current task is investigating javascript obfuscation techniques. I am
trying to get a relatively large sample of website containing
javascript code so I can analyze it and determine if it is:
1) obfuscated
2) malicious
What is the sense of determining if js is obfuscated?

You would first need a decent definition of obfuscation.

Do you really think, or does your employer, that the level of
"obfuscation" is a measure of probability of maliciousness?

The common understanding on ths NG is, methinks, that obfuscation only
deters the users that cannot even read plain js.
I have a fairly decent inference what the result will be, but it would
be nice to have statistics on my side. Having said that, I believe it
will be necessary to have a very large sample size to perform my
analysis.
You should do a random pilot and extrapolate, having determined the
randomness with other parameters. A professional statistician looking
over your shoulder is a must here. Do not throw salt into her eyes.
Now for my question, does anyone know if there are any ways to utilize
a web browser or other component to automatically find javascript
samples? Google has not yielded any results, and the code search
merely searches repositories; not exactly what I need.

Short of rolling my own crawler, can anyone offer any suggestions that
will aid me in my task?
Someone has to crawl, and if it's not Google, it must be you, meseems.

Builing one is not that difficult, just write a httpxml function.

I would use Google with some simple words to get a fast amount of URLs
and measure the amount of bytes between <script and /scriptin the
received strings, and check for external .js files.
--
Evertjan.
The Netherlands.
(Please change the x'es to dots in my emailaddress)
Jul 24 '07 #2
On Jul 24, 11:23 am, "Evertjan." <exjxw.hannivo...@interxnl.net>
wrote:
Steve H. wrote on 24 jul 2007 in comp.lang.javascript:

What is the sense of determining if js is obfuscated?

You would first need a decent definition of obfuscation.

Do you really think, or does your employer, that the level of
"obfuscation" is a measure of probability of maliciousness?
No, I do not think this, nor does my employer.
The common understanding on ths NG is, methinks, that obfuscation only
deters the users that cannot even read plain js.
I agree.
I have a fairly decent inference what the result will be, but it would
be nice to have statistics on my side. Having said that, I believe it
will be necessary to have a very large sample size to perform my
analysis.

You should do a random pilot and extrapolate, having determined the
randomness with other parameters. A professional statistician looking
over your shoulder is a must here. Do not throw salt into her eyes.
This is a bit assuming, but thank you for the suggestion. Let's just
say that there are enough people in my vicinity to verify my results
and ensure that perform statistical tests properly. Having said that,
I am no stranger to the field.
Now for my question, does anyone know if there are any ways to utilize
a web browser or other component to automatically find javascript
samples? Google has not yielded any results, and the code search
merely searches repositories; not exactly what I need.
Short of rolling my own crawler, can anyone offer any suggestions that
will aid me in my task?

Someone has to crawl, and if it's not Google, it must be you, meseems.

Builing one is not that difficult, just write a httpxml function.
I wasn't really concerned with difficulty, I was just wondering if
someone knew of a method to save me some time; I am currently juggling
multiple projects and this one is a little lower in priority than
others.
I would use Google with some simple words to get a fast amount of URLs
and measure the amount of bytes between <script and /scriptin the
received strings, and check for external .js files.
I will probably write my own crawler in conjunction with the google
api.

Thank you again for your suggestions, but I found many of your
statements assuming and/or loaded. I wish you would have asked me
questions for clarification without introducing a bias into the way
you ask said questions; personally, I find that a bit insulting.

--
Steve

Jul 24 '07 #3
Steve H. said the following on 7/24/2007 2:02 PM:
Hello all,
I am a visiting researcher at a laboratory this summer and my
current task is investigating javascript obfuscation techniques. I am
trying to get a relatively large sample of website containing
javascript code so I can analyze it and determine if it is:
1) obfuscated
2) malicious
The only effective way to know either of those is to sit and study the
code manually and determine whether it is obfuscated and/or malicious.
There is no tell-tell sign as to whether code is obfuscated or
malicious. I assume you will be doing that yourself? Also, you would
need a pretty good understanding of JS to know what is malicious or not
(and it depends on your definition of malicious).

Is this code malicious?

<script type="text/javascript">
function closeTheWindow(){
self.close();
}
</script>

Personally, I find it malicious as there is no good use for it but it
does nothing "malicious" to the users computer.

As for obfuscated code, if a site has both - obfuscated and plain code -
does it go as obfuscated or not?
I have a fairly decent inference what the result will be, but it would
be nice to have statistics on my side. Having said that, I believe it
will be necessary to have a very large sample size to perform my
analysis.
A simple Google Search for the four words "and", "but", "or" and "the"
returns 14,140,000,000 pages. Ironically, if I add "OR usenet" it lowers
the results when it would reason to leave them the same or increase
them. Gotta love Google.

<URL:
http://www.google.com/search?hl=en&lr=&as_qdr=all&q=+and+OR+but+OR+or+OR +the>

How much larger a sample do you want? I think that search would fairly
indicative of the web in general as it doesn't skew the results towards
any particular thing other than English pages.
Now for my question, does anyone know if there are any ways to utilize
a web browser or other component to automatically find javascript
samples? Google has not yielded any results, and the code search
merely searches repositories; not exactly what I need.
You could use a browser to retrieve the pages as a crawler but the task
of determining whether a page has script in it or not - across a very
large number of pages - would best be left to some other program. An XHR
request running locally doesn't suffer from a cross domain issue so you
could simply feed an IE page a million or so URL's and have it retrieve
each, search it for script, log the results. The script pages you would
have to go back manually and review though. 3 seconds to do a page is
being very generous for the time it would take to retrieve the page,
search it's contents, log the results, read another URL and issue a
request for it. At 3 seconds, a simple million pages would take you 800+
hours to machine process them. Then the time of manually processing
those 1 million pages gets astronomical. Make it 14 billion pages and
your grandchildren wouldn't get it finished with one computer.
Short of rolling my own crawler, can anyone offer any suggestions that
will aid me in my task?
While I am curious about some hard analysis to see whether it is as
prevalent as I think it is (scripting itself) along with
malicious/obfuscated code, without a very large sample (Above a billion
pages), then the results would have to be skewed in one direction or the
other and in the end that makes those statistics useless for a real
world observation.

--
Randy
Chance Favors The Prepared Mind
comp.lang.javascript FAQ - http://jibbering.com/faq/index.html
Javascript Best Practices - http://www.JavascriptToolbox.com/bestpractices/
Jul 25 '07 #4
On Jul 25, 4:02 am, "Steve H." <steve.c.ha...@gmail.comwrote:
Hello all,
I am a visiting researcher at a laboratory this summer and my
current task is investigating javascript obfuscation techniques. I am
trying to get a relatively large sample of website containing
javascript code so I can analyze it and determine if it is:
1) obfuscated
2) malicious

I have a fairly decent inference what the result will be, but it would
be nice to have statistics on my side. Having said that, I believe it
will be necessary to have a very large sample size to perform my
analysis.

Now for my question, does anyone know if there are any ways to utilize
a web browser or other component to automatically find javascript
samples? Google has not yielded any results, and the code search
merely searches repositories; not exactly what I need.

Short of rolling my own crawler, can anyone offer any suggestions that
will aid me in my task?
Detecting obfuscated code should be fairly straight forward, look for
the patterns:

function <identifier>
var <identifier>

and compare the amount of white space to character data. If the
average length of identifiers is short (say 2 characters) and the
percentage of white space is very low (say less than 5%, testing will
tell), the code is likely obfuscated.

I don't know if you intend to infer any particular motive to
obfuscation, but when used to minimize identifier lengths and remove
all unnecessary white space (i.e. minification) it can seriously
reduce the size of scripts, providing the benefits of faster downloads
and lower data volume. The fact that obfuscated code is also (very)
difficult to read is seen as a bonus by some, though it should not be
the primary purpose for using it.

For example, Google's map scripts are (or were, I haven't checked
lately) obfuscated, yet within a very short time manually 'de-
obfuscated' versions appeared on the web, published by those who
wanted to share how it worked. I expect Google wasn't concerned about
that as they were likely after the minification benefits rather than
attempting to protect their copyright.

As for malicious code, I think you need to know exactly what you are
looking for, e.g. the recently publicised IE and Firefox protocol
handling flaw or the supposed iPhone vulnerability. I think
javascript might be used as a transport to say deliver an malicious
object (say applet, animation or image), but it is unlikely that the
script itself will be malicious.
--
Rob

Jul 25 '07 #5
RobG said the following on 7/24/2007 10:32 PM:
On Jul 25, 4:02 am, "Steve H." <steve.c.ha...@gmail.comwrote:
>Hello all,
I am a visiting researcher at a laboratory this summer and my
current task is investigating javascript obfuscation techniques. I am
trying to get a relatively large sample of website containing
javascript code so I can analyze it and determine if it is:
1) obfuscated
2) malicious

I have a fairly decent inference what the result will be, but it would
be nice to have statistics on my side. Having said that, I believe it
will be necessary to have a very large sample size to perform my
analysis.

Now for my question, does anyone know if there are any ways to utilize
a web browser or other component to automatically find javascript
samples? Google has not yielded any results, and the code search
merely searches repositories; not exactly what I need.

Short of rolling my own crawler, can anyone offer any suggestions that
will aid me in my task?

Detecting obfuscated code should be fairly straight forward, look for
the patterns:

function <identifier>
var <identifier>
That pattern exists in minimized code but very seldom appears in the raw
code of obfuscated code. Many of which start with a pattern similar to this:

var x = ".............."
eval(x)

--
Randy
Chance Favors The Prepared Mind
comp.lang.javascript FAQ - http://jibbering.com/faq/index.html
Javascript Best Practices - http://www.JavascriptToolbox.com/bestpractices/
Jul 25 '07 #6
Steve H. wrote on 24 jul 2007 in comp.lang.javascript:
Thank you again for your suggestions, but I found many of your
statements assuming and/or loaded. I wish you would have asked me
questions for clarification without introducing a bias into the way
you ask said questions; personally, I find that a bit insulting.
You were on the asking side, providing not even enough info about your
own presumed qualities, so if you want only niceties, try a paid
helpdesk.

This is usenet, so get used to it, Steve.
>You should do a random pilot and extrapolate, having determined the
randomness with other parameters. A professional statistician looking
over your shoulder is a must here. Do not throw salt into her eyes.
This is a bit assuming, but thank you for the suggestion. Let's just
say that there are enough people in my vicinity to verify my results
and ensure that perform statistical tests properly. Having said that,
I am no stranger to the field.
Again, how could we know you are "no stranger to the field" of
statistics?

In the medical field, where I work, checking your own research statistics
is rightly felt to introduce hidden biases.
>Do you really think, or does your employer, that the level of
"obfuscation" is a measure of probability of maliciousness?

No, I do not think this, nor does my employer.
So why are you [plural] searching for obfuscation at all, if,
as I surmize, you are after malicious code on the web?

===

I think a properly, in the statistical sense, conducted pilot will give
you a reasonable idea about the computer time involved to find enough of
the code you are after. Perhaps the main enterprize would take 12 years,
or 2 hours of computer time, who is to say without a pilot? And even then
extrapolation, the standard goal of a pilot, remains dangerous as some
hidden timing effect could act exponentially or the pilot's url batch
could prove to be non representative on a larger scale.

--
Evertjan.
The Netherlands.
(Please change the x'es to dots in my emailaddress)
Jul 25 '07 #7
RobG wrote:
I don't know if you intend to infer any particular motive to
obfuscation, but when used to minimize identifier lengths and remove
all unnecessary white space (i.e. minification) it can seriously
reduce the size of scripts, providing the benefits of faster downloads
and lower data volume. [...]
I wouldn't be so sure about that. For example, omitting white space
characters tends to require delimiter characters that were otherwise not
needed.
PointedEars
--
Prototype.js was written by people who don't know javascript for people
who don't know javascript. People who don't know javascript are not the
best source of advice on designing systems that use javascript.
-- Richard Cornford, <f8*******************@news.demon.co.uk>
Aug 1 '07 #8
On Jul 24, 7:32?pm, RobG <rg...@iinet.net.auwrote:
On Jul 25, 4:02 am, "Steve H." <steve.c.ha...@gmail.comwrote:


Hello all,
I am a visiting researcher at a laboratory this summer and my
current task is investigating javascript obfuscation techniques. I am
trying to get a relatively large sample of website containing
javascript code so I can analyze it and determine if it is:
1) obfuscated
2) malicious
I have a fairly decent inference what the result will be, but it would
be nice to have statistics on my side. Having said that, I believe it
will be necessary to have a very large sample size to perform my
analysis.
Now for my question, does anyone know if there are any ways to utilize
a web browser or other component to automatically find javascript
samples? Google has not yielded any results, and the code search
merely searches repositories; not exactly what I need.
Short of rolling my own crawler, can anyone offer any suggestions that
will aid me in my task?

Detecting obfuscated code should be fairly straight forward, look for
the patterns:

function <identifier>
var <identifier>

and compare the amount of white space to character data. If the
average length of identifiers is short (say 2 characters) and the
percentage of white space is very low (say less than 5%, testing will
tell), the code is likely obfuscated.

I don't know if you intend to infer any particular motive to
obfuscation, but when used to minimize identifier lengths and remove
all unnecessary white space (i.e. minification) it can seriously
reduce the size of scripts, providing the benefits of faster downloads
and lower data volume. The fact that obfuscated code is also (very)
difficult to read is seen as a bonus by some, though it should not be
the primary purpose for using it.

For example, Google's map scripts are (or were, I haven't checked
lately) obfuscated, yet within a very short time manually 'de-
obfuscated' versions appeared on the web, published by those who
wanted to share how it worked. I expect Google wasn't concerned about
that as they were likely after the minification benefits rather than
attempting to protect their copyright.

As for malicious code, I think you need to know exactly what you are
looking for, e.g. the recently publicised IE and Firefox protocol
handling flaw or the supposed iPhone vulnerability. I think
javascript might be used as a transport to say deliver an malicious
object (say applet, animation or image), but it is unlikely that the
script itself will be malicious.

--
Rob- Hide quoted text -

- Show quoted text -
Also, another test for obfuscation maybe to check if there are any
comments in the script. Comments are usually removed from the source
in compressed/obfuscated code.
Aug 1 '07 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

41
by: Mr. x | last post by:
Hello, Can I make my java script code be invisible to other people who enter into my site by IE browser ? - How ? Thanks :)
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
1
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
0
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.